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Abstract 

Scheduling  techniques  are  typically  developed  for  spe¬ 
cific  industries  and  applications  through  extensive  inter¬ 
views  with  domain  experts  to  codify  effective  heuristics 
and  solution  strategies.  As  an  alternative,  we  present 
a  technique  called  Collaborative  Optimization  via  Ap¬ 
prenticeship  Scheduling  (COVAS),  which  performs  ma¬ 
chine  learning  using  human  expert  demonstration,  in 
conjunction  with  optimization,  to  automatically  and  ef¬ 
ficiently  produce  optimal  solutions  to  challenging  real- 
world  scheduling  problems.  COVAS  first  learns  a  policy 
from  human  scheduling  demonstration  via  apprentice¬ 
ship  learning,  then  uses  this  initial  solution  to  provide  a 
tight  bound  on  the  value  of  the  optimal  solution,  thereby 
substantially  improving  the  efficiency  of  a  branch-and- 
bound  search  for  an  optimal  schedule.  We  demonstrate 
this  technique  on  a  variant  of  the  weapon-to-target  as¬ 
signment  problem,  and  show  that  it  generates  substan¬ 
tially  superior  solutions  to  those  produced  by  human  do¬ 
main  experts,  at  a  rate  up  to  ~  10  times  faster  than  an 
optimization  approach  that  does  not  incorporate  human 
expert  demonstration. 

Introduction 

Scheduling  is  a  costly  problem  for  many  industries,  both 
with  regard  to  the  effort  required  to  develop  a  solution  tech¬ 
nique  and  the  time  necessary  to  produce  a  schedule.  How¬ 
ever,  attempts  to  take  “shortcuts  within  the  scheduling  pro¬ 
cess  may  yield  low-quality  schedules  that  result  in  wasted 
resources.  Traditionally,  scheduling  techniques  are  devel¬ 
oped  for  specific  industries  and  applications  by  consultants 
who  conduct  extensive  interviews  with  the  domain  experts 
who  manually  or  semi-manually  perform  scheduling  tasks. 
The  goal  of  these  interviews  is  to  codify  effective  heuris¬ 
tics  and  strategies  for  the  problem,  in  order  to  then  craft 
efficient  automated  scheduling  techniques.  This  process 
is  time-consuming  and  largely  manual,  because  while  do¬ 
main  experts  can  readily  explain  the  key  aspects  or  fea- 
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tures  of  their  problem-solving  strategy,  they  are  not  typ¬ 
ically  able  to  precisely  describe  how  they  use  those  fea¬ 
tures  when  making  decisions  (Cheng,  Wei,  and  Tseng  2006; 
Raghavan,  Madani,  and  lones  2006). 

In  this  work,  we  propose  Collaborative  Optimization  via 
Apprenticeship  Scheduling  (COVAS),  an  approach  that  in¬ 
corporates  machine  learning  from  human  expert  demonstra¬ 
tion,  in  conjunction  with  optimization,  to  automatically  and 
efficiently  produce  optimal  solutions  to  challenging  real- 
world  scheduling  problems.  Our  method  performs  policy 
learning  using  a  training  dataset  comprised  of  schedules 
demonstrated  by  humans,  as  well  as  a  recently  developed 
technique  for  apprenticeship  scheduling  (Gombolay  et  al. 
2016a).  In  prior  work,  the  technique  was  proposed  in  or¬ 
der  to  simply  emulate  human  expert  scheduling  decisions; 
in  this  work,  we  use  the  apprenticeship  scheduler  to  gen¬ 
erate  a  favorable  (if  suboptimal)  initial  solution  to  a  new 
scheduling  problem.  To  guarantee  that  the  generated  sched¬ 
ule  is  serviceable,  we  augment  the  apprenticeship  scheduler 
to  solve  a  constraint  satisfaction  problem,  ensuring  that  the 
execution  of  each  scheduling  commitment  does  not  directly 
result  in  infeasibility  for  the  new  problem.  COVAS  uses  this 
initial  solution  to  provide  a  tight  bound  on  the  value  of  the 
optimal  solution,  substantially  improving  the  efficiency  of  a 
branch-and-bound  search  for  an  optimal  schedule. 

We  demonstrate  our  approach  by  solving  a  real-world 
anti-ship  missile  defense  problem,  and  report  that  COVAS 
produces  substantially  superior  solutions  to  those  produced 
by  human  domain  experts,  at  a  rate  twice  as  fast  as  an  op¬ 
timization  approach  that  does  not  incorporate  human  expert 
demonstration. 

Related  Work 

Recent  research  has  aimed  to  capture  goal-based  knowl¬ 
edge  obtained  through  demonstration  via  a  process  known 
as  reward  learning  (Abbeel  and  Ng  2004;  Berry  et  al.  2011; 
Ijspeert,  Nakanishi,  and  Schaal  2002;  Konidaris,  Osen- 
toski,  and  Thomas  2011;  Zheng,  Liu,  and  Ni  2015;  Odom 
and  Natarajan  2015;  Terrell  and  Mutlu  2012;  Thomaz  and 
Breazeal  2006;  Vogel  et  al.  2012;  Ziebart  et  al.  2008).  In¬ 
verse  reinforcement  learning  (IRL),  the  most  common  ap¬ 
proach,  learns  a  reward  function  to  capture  the  intent  of  the 
demonstrators  and  then  trains  a  policy  via  reinforcement 
learning  to  maximize  that  reward  function.  However,  as 
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noted  in  prior  works  (Gombolay  et  al.  2016a;  Wu  et  al.  2011; 
Wang  and  Usher  2005;  Zhang  and  Dietterich  1995),  the 
large  amount  of  data  required  to  regress  over  the  large 
state  spaces  associated  with  scheduling  problems  remains 
daunting,  and  RL-based  scheduling  solutions  exist  only  for 
simple  problems  (Wu  et  al.  2011;  Wang  and  Usher  2005; 
Zhang  and  Dietterich  1995). 

An  alternate  approach  specially  designed  for  meeting 
scheduling  (Berry  et  al.  201 1)  requires  users  to  complete  an 
extensive  questionnaire  in  order  to  solicit  their  preferences 
for  scheduling  meetings.  This  technique  then  maps  those 
preferences  to  an  objective  function  and  solves  for  the  op¬ 
timal  meeting  schedule  via  a  mixed-integer  linear  program 
(MILP).  However,  this  approach  is  limited  to  small  problems 
that  could  be  efficiently  solved  as  an  integer  linear  program 
(Berry  et  al.  2011).  State-of-the-art  techniques  for  solving 
scheduling  problems  with  complex  temporal  constraints  via 
integer  linear  programs  are  limited  to  problems  involving 
five  agents  and  50  tasks,  at  most  (Cire,  Coban,  and  Hooker 
2013). 

Another  approach,  called  policy  learning,  focuses  on 
learning  a  mapping  from  states  to  actions  (Chernova  and 
Veloso  2007;  Huang  and  Mutlu  2014;  Sammut  et  al.  1992; 
Ramanujam  and  Balakrishnan  2011).  This  technique  has 
been  applied  to  learn  cognitive  decision-making  tasks  from 
human  experts,  such  as  determining  an  airport  runway  con¬ 
figuration  (Ramanujam  and  Balakrishnan  2011).  Similarly, 
the  learning  system  AlphaGo  incorporates  an  initial  policy¬ 
leaning  phase  (Silver  et  al.  2016).  The  AlphaGo  frame¬ 
work  began  by  solving  a  supervised  policy  learning  prob¬ 
lem  to  imitate  the  decision-making  of  human  Go  players. 
AlphaGo’s  policy  was  then  improved  through  self-play  us¬ 
ing  a  policy  gradient  algorithm  (Sutton  et  al.  1999).  This 
approach  is  promising  for  solving  scheduling  problems  by 
learning  policies  through  expert  demonstration.  However, 
we  are  unaware  of  any  prior  attempts  to  apply  policy  learn¬ 
ing  to  the  scheduling  domain,  other  than  work  by  Gombo¬ 
lay  et  al.  (Gombolay  et  al.  2016a),  which  aimed  to  simply 
emulate  human  expert  scheduling  policies,  rather  than  im¬ 
prove  upon  them.  Techniques  that  rely  upon  function  ap¬ 
proximation  and  policy  gradient  descent  or  variants  of  q- 
learning  (such  as  the  framework  employed  by  AlphaGo)  are 
less  desirable  for  scheduling  applications,  as  it  is  inefficient 
to  specify  or  learn  complex  temporal  constraints  which  are 
often  non-Markovian  and  the  solution  techniques  are  only 
guaranteed  to  converge  to  a  local  optimal  solution. 

A  small  number  of  prior  works  have  pursued  approaches 
outside  of  the  family  of  techniques  for  policy  learning.  Some 
directly  modeled  the  trustworthiness  of  the  demonstrations 
via  robust  Bayesian  inverse  reinforcement  learning  (Zheng, 
Liu,  and  Ni  2015).  For  example,  Zheng  et  al.  showed  that 
their  approach  was  better  able  to  capture  the  ground-truth 
objective  function  from  imperfect  training  data  than  regu¬ 
lar,  Bayesian  IRL  (Ramachandran  and  Amir  2007),  which 
does  not  include  a  trustworthiness  parameter  for  demonstra¬ 
tions.  Zheng  et  al.  validated  their  approach  using  a  synthetic 
dataset  in  an  experiment  with  the  goal  of  identifying  the  best 
route  through  an  urban  domain. 

Banerjee  et  al.  addressed  a  domain  in  which  the  sys- 


Figure  1:  The  COVAS  architecture. 


tem  was  required  to  repeatedly  solve  a  scheduling  problem 
wherein  the  variables  remained  the  same,  but  the  constraints 
for  those  variables  changed  (Banerjee  et  al.  2011).  Using 
a  MILP  formulation,  they  proposed  a  machine  learning- 
optimization  pipeline  in  which  the  system  performed  a 
branch-and-bound  search  over  the  integer  variables,  and 
used  the  prediction  of  a  regression  algorithm  trained  on  ex¬ 
amples  of  previously  solved  problems  to  provide  a  provable 
lowerbound  on  the  optimality  of  the  current  integer  variable 
assignments.  A  shortcoming  of  this  approach  is  its  reliance 
upon  the  ability  to  generate  a  large  database  of  solutions  to 
train  the  regression  algorithm.  This  generation  requires  the 
costly  exercise  of  repeatedly  solving  a  large  set  of  MILPs. 

The  technique  presented  in  this  paper  was  inspired  by 
these  prior  works,  which  synthesize  machine  learning  tech¬ 
niques,  optimization  and  human  demonstration.  To  our 
knowledge,  our  work  is  the  first  to  develop  and  demonstrate 
an  approach  to  learning  through  human  demonstrations  to 
efficiently  produce  optimal  solutions  for  complex  real-world 
scheduling  problems.  Our  method  employs  a  policy  learn¬ 
ing  phase  to  leant  from  human  demonstration,  and  uses  the 
resulting  policy  as  an  initial  solution  to  provide  a  tight  bound 
on  the  value  of  the  optimal  solution.  We  show  that  this 
policy  can  be  used  in  conjunction  with  a  MILP  solver  to 
substantially  improve  the  efficiency  of  a  branch-and-bound 
search  for  an  optimal  schedule.  Our  work  is  distinguished 
from  prior  works  that  incorporated  policy  gradient  descent 
or  variants  of  q-learning  in  that  COVAS  is  guaranteed  to  pro¬ 
duce  a  globally  optimal  solution  to  the  scheduling  problem. 
Also,  COVAS  can  be  employed  as  an  anytime  algorithm  that 
provides  a  bound  on  the  sub-optimality  of  the  solution. 

Model  for  Collaborative  Optimization  via 
Apprenticeship  Scheduling 

Here,  we  provide  an  overview  of  the  COVAS  architecture, 
and  then  present  its  two  components:  the  policy  learning 
and  optimization  routines. 

COVAS  Architecture 

Figure  1  depicts  an  overview  of  the  COVAS  framework. 

The  system  takes  as  input  a  set  of  domain  expert  schedul¬ 
ing  demonstrations  (e.g.,  Gantt  charts,  as  shown  in  Figure  1) 
that  contains  information  describing  which  agents  complete 
which  tasks,  when  and  where.  These  demonstrations  are 
passed  to  an  apprenticeship  scheduling  algorithm  that  learns 


a  classifier,  f priority  (t-; ,  Tj),  to  predict  whether  the  demon¬ 
strators)  would  have  chosen  scheduling  action  r*  over  ac¬ 
tion  Tj  £  T. 

Next,  COVAS  uses  f priority  (t~i  ,  Tj )  to  construct  a  sched¬ 
ule  for  a  new  problem.  COVAS  creates  an  event-based  sim¬ 
ulation  of  this  new  problem  and  runs  the  simulation  in  time 
until  all  tasks  have  been  completed.  In  order  to  complete 
tasks,  COVAS  uses  f priority  (ji,  Tj)  at  each  moment  in  time 
to  select  the  best  scheduling  action  to  take.  We  describe  this 
process  in  detail  in  the  next  section. 

Next,  COVAS  provides  this  output  as  an  initial  seed  solu¬ 
tion  to  an  optimization  subroutine  (i.e.,  a  MILP  solver). The 
initial  solution  produced  by  the  apprenticeship  scheduler  im¬ 
proves  the  efficiency  of  a  search  by  providing  a  bound  on  the 
objective  function  value  of  the  optimal  schedule. 

Here,  we  briefly  review  the  basic  technique  for  solving  a 
MILP  for  a  full  overview,  we  refer  the  reader  to  (Bertsimas 
and  Weismantel  2005).  In  general,  solving  a  MILP  requires 
iteratively  identifying  ever-tighter  upper-  and  lowerbounds 
for  a  given  problem  in  order  to  inform  a  branch-and-bound 
search  over  the  integer  variables.  To  find  an  upperbound, 
one  must  satisfy  the  constraints  of  the  MILP:  Ax  <  b. 
To  identify  a  lowerbound,  one  can  solve  a  linear  relaxation 
of  the  problem.Such  a  relaxation  can  be  computed  quickly; 
however,  it  rarely  results  in  a  feasible  solution.  As  each  new 
upper-  and  lowerbound  solution  is  found,  the  algorithm  is 
able  to  prune  areas  of  the  search  tree  and  focus  its  search  on 
areas  that  can  yield  the  optimal  solution.  After  the  algorithm 
has  identified  an  upper-  and  lowerbound  within  some  thresh¬ 
old,  COVAS  returns  the  solutions  that  have  been  proven  op¬ 
timal  within  that  threshold.  Thus,  an  operator  can  use  CO¬ 
VAS  as  an  anytime  algorithm  and  terminate  the  optimization 
upon  finding  a  solution  that  is  acceptable  within  a  provable 
bound. 

Apprenticeship  Scheduling  Subroutine 

In  this  section,  we  review  the  apprenticeship  scheduling  sub¬ 
routine  for  COVAS.  Our  approach  incorporates  policy  learn¬ 
ing  using  a  training  dataset  comprised  of  schedules  demon¬ 
strated  by  humans,  as  well  as  a  recently  developed  technique 
for  apprenticeship  scheduling.  The  apprenticeship  schedul¬ 
ing  algorithm  (Gombolay  et  al.  2016a)  takes  as  input  demon¬ 
strations  in  which  human  experts  manually  solve  randomly 
generated  variants  of  a  real-world  scheduling  problem.  The 
apprenticeship  scheduler  has  been  shown  in  empirical  eval¬ 
uation  to  learn  a  policy  that  effectively  emulates  the  human 
expert  scheduling  policy  in  a  new  problem  variant.  In  this 
work,  we  use  the  apprenticeship  scheduler  to  generate  a  fa¬ 
vorable  (if  suboptimal)  initial  solution  to  a  new  scheduling 
problem.  To  guarantee  that  the  generated  schedule  is  ser¬ 
viceable,  we  augment  the  apprenticeship  scheduler  to  solve 
a  constraint  satisfaction  problem  in  order  to  ensure  that  each 
scheduling  commitment  does  not  directly  result  in  infeasi¬ 
bility  for  the  new  problem  through  the  execution  of  that  ac¬ 
tion. 

Consider  a  scheduling  problem  containing  a  set  of  tasks 
/i  £  M,  agents  a  £  A  and  locations  to  complete  tasks 
x  £  X,  as  well  as  a  set  of  scheduling  actions  taken  at  each 
moment  in  time  Tj  =  (/ x,a,x,t ).  For  each  action  taken, 


Tj,  the  learning  system  can  also  compute  the  set  of  actions 
not  taken,  t,  £  t.  Each  tuple  has  an  associated  real-valued 
feature  vector,  7 T. .  Features  of  this  vector  may  include  the 
deadline  for  fi,  the  distance  from  the  agent’s  current  location 
to  x  or  how  quickly  the  agent  is  able  to  complete  the  task. 
The  system  allows,  for  a  given  moment  in  time,  all  possi¬ 
ble  tuples  to  share  a  common  pointwise  feature  vector,  £t, 
which  captures  features  that  are  not  well  described  by  pair¬ 
wise  comparisons,  such  as  the  proportion  of  completed  tasks 
or  of  idle  agents. 
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These  vectors  serve  to  create  the  training  data,  as  shown  in 
Equations  1-2.  For  each  scheduling  observation  (i.e.,  a  spe¬ 
cific  time  point  within  a  schedule),  the  system  creates  a  set 
of  positive  and  negative  examples.  For  each  such  moment, 
we  take  the  feature  vector  of  the  action  taken,  jT. ;  less  the 
feature  vector  of  an  action  not  taken,  7Tj  ;  concatenate  to  that 
difference  the  pointwise  vector,  (7-;  and,  given  that  example, 
a  positive  label  (Equation  1).  To  create  a  negative  exam¬ 
ple,  we  take  the  feature  vector  of  an  action  not  taken,  jT.  ; 
less  the  feature  vector  of  the  action  taken,  jTi ;  and  concate¬ 
nate  to  that  difference  the  pointwise  vector,  ^  (Equation  2). 
We  create  one  positive  and  negative  example  for  each  ac¬ 
tion  not  taken,  Tj,  for  each  observation.  Finally,  we  train 
a  classifier  on  these  examples  to  learn  a  priority  function, 
f priority  (ti,  Tj),  in  order  to  predict  whether  scheduling  ac¬ 
tion  Tj  is  better  or  worse  than  Tj.  The  computational  com¬ 
plexity  of  the  algorithm  vis-a-vis  Equation  3  is  (J(\T\2d)  per 
time  step,  where  d  is  the  maximum  depth  of  the  decision  tree 
(Gombolay  et  al.  2016a). 

In  this  work,  the  learned  policy  /priority  (Ti,  Tj)  is  applied 
to  obtain  the  initial  solution  to  a  new  scheduling  problem 
as  follows:  First,  the  user  must  instantiate  a  simulation  of 
the  scheduling  domain;  then,  at  each  time  step  in  the  sim¬ 
ulation,  take  the  scheduling  action  predicted  by  Equation  3 
to  be  the  action  that  the  human  demonstrators  would  take. 
This  equation  identifies  the  task  Tj  with  the  highest  impor¬ 
tance  marginalized  over  all  other  tasks  Tj  £  r. 

Each  selected  action  is  then  validated  using  a  schedula- 
bility  test  (i.e.,  solving  a  constraint  satisfaction  problem)  to 
ensure  that  direct  application  of  that  action  does  not  violate 
the  constraints  of  the  new  problem.  For  example,  in  anti¬ 
ship  missile  defense,  one  would  check  to  ensure  that  the  ac¬ 
tion  does  not  result  in  a  suicidal  deployment  (i.e.,  the  decoy 
directly  causes  a  missile  to  impact  the  ship).  The  test  must 
be  designed  to  be  fast  (e.g.,  polynomial  complexity)  so  as  to 
make  the  benefit  to  feasibility  and  optimality  in  the  resulting 
schedule  worth  the  additional  complexity.  If,  at  a  given  time 
step,  t*  does  not  satisfy  the  schedulability  test,  COVAS  uses 
Equation  3  for  all  Tj  €  r\r*  in  order  to  consider  the  second- 
best  action.  If  no  action  t,  £  r  passes  the  schedulability  test, 
no  action  is  taken  during  that  time  step. 

While  the  schedulability  test  forces  the  apprenticeship 
scheduling  algorithm  to  follow  a  subset  of  the  full  con- 


straints  in  the  MILP  formulation,  it  is  possible  that  the  algo¬ 
rithm  my  not  successfully  complete  all  tasks.  However,  our 
MILP  formulation  is  flexible  in  such  cases,  as  we  present  in 
the  next  section.  Here,  we  model  tasks  as  optional  and  use 
the  objective  function  to  maximize  the  total  number  of  tasks 
completed.  In  turn,  constraints  for  a  task  that  the  appren¬ 
ticeship  scheduling  algorithm  did  not  satisfactorily  complete 
can  be  turned  off,  with  a  corresponding  penalty  in  the  objec¬ 
tive  function  score.  Thus,  an  initial  seed  solution  that  has  not 
completed  all  tasks  (i.e.,  satisfied  all  constraints  to  complete 
the  task)  can  still  be  helpful  for  seeding  the  MILP. 

T*  =  argpiax  jTj  f  priority  (Ti  ,  )  (3) 

rl£T 

Optimization  Subroutine 

For  optimization,  we  employ  mathematical  programming 
techniques  to  solve  mixed-integer  linear  programs  via 
branch-and-bound  search.  COVAS  incorporates  the  solution 
produced  by  the  apprenticeship  scheduler  to  seed  a  mathe¬ 
matical  programming  solver  with  an  initial  solution.  This  is 
a  built-in  capability  provided  by  many  off-the-shelf,  state- 
of-the-art  MILP  solvers,  including  CPLEX1  and  Gurobi2. 
This  seed  provides  a  tight  bound  on  the  value  of  the  optimal 
solution,  which  serves  to  dramatically  cut  the  search  space, 
allowing  the  system  to  more  quickly  hone  in  on  the  area 
containing  the  optimal  solution  and,  in  turn,  more  quickly 
solve  the  optimization  problem.  Furthermore,  this  approach 
allows  COVAS  to  quickly  achieve  a  bound  on  the  optimal¬ 
ity  of  the  solution  provided  by  the  apprenticeship  schedul¬ 
ing  subroutine.  In  such  a  manner,  an  operator  can  determine 
whether  the  apprenticeship  scheduling  solution  is  acceptable 
or  whether  waiting  for  successive  solutions  is  warranted. 

Methodology  for  Evaluation  of  COVAS  with  a 
Real-World  Scheduling  Domain 

Here,  we  demonstrate  COVAS  in  the  context  of  a  real- 
world  anti-ship  missile  defense  (ASMD)  problem.  First,  we 
formally  define  the  problem  a  variant  of  the  well-studied 
weapon-to-target  assignment  problem  (Ahuja  et  al.  2007) 
and  outline  its  usefulness  as  an  appropriate  test  domain  for 
COVAS. 

Overview  of  Anti-Ship  Missile  Defense  Problem 

In  ASMD,  the  goal  is  to  protect  one’s  naval  vessel  against 
attacks  by  heterogeneous  anti-ship  missiles.  Recent  tech¬ 
nological  advances  in  electronic  warfare  have  prompted  the 
development  of  what  are  known  as  “soft-kill  weapons  (i.e., 
decoys)  that  mimic  the  qualities  of  a  target  in  order  to  direct 
the  missile  away  from  its  intended  destination. 

Developing  tactics  for  soft-kill  weapon  coordination  is 
highly  difficult  due  to  the  relationship  between  missile  be¬ 
havior  and  the  characteristics  of  soft-kill  weapons.  The  con¬ 
trol  laws  governing  anti-ship  missiles  are  varied,  and  the 
captain  must  select  the  correct  decoy  types  in  order  to  coun¬ 
teract  the  associated  anti-ship  missiles.  Further,  decoys  have 
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different  financial  costs  and  timing  characteristics:  Some  de¬ 
coys,  such  as  unmanned  aerial  vehicles  (UAVs),  are  able  to 
function  during  the  entire  engagement,  while  others,  such  as 
an  infrared  (IR)  flares,  evaporate  after  a  certain  time.  In  turn, 
a  captain  may  be  required  to  use  multiple  decoys  in  tandem 
in  order  to  divert  a  single  anti-ship  missile.  Moreover,  there 
is  a  complex  interplay  between  the  types  and  locations  of 
decoys  relative  to  the  control  laws  governing  anti-ship  mis¬ 
siles.  For  example,  deployment  of  a  particular  decoy,  while 
effective  against  one  airborne  enemy  missile,  may  actually 
cause  a  second  enemy  missile  that  was  previously  homing 
in  on  a  second  decoy  to  now  impact  the  ship  when  it  would 
have  missed  otherwise. 

The  ASMD  problem  is  characterized  as  the  most  com¬ 
plex  class  of  scheduling  problem  according  to  the  Korsah  et 
al.  taxonomy  (Korsah,  Stentz,  and  Dias  2013):  XD  [MA- 
MT-TA].  The  problem  considers  multi-task  agents  (MA)  in 
the  form  of  decoys,  each  of  which  can  work  to  divert  mul¬ 
tiple  missiles  at  the  same  time.  The  problem  also  consid¬ 
ers  multi-agent  tasks  (MT):  a  feasible  solution  may  require 
the  simultaneous  use  of  multiple  agents  in  order  to  complete 
an  individual  task.  Further,  time-extended  agent  allocation 
(TA)  must  be  taken  into  consideration,  given  the  potential 
future  consequences  of  scheduling  actions  taken  at  the  cur¬ 
rent  moment.  Finally,  the  ASMD  problem  falls  within  the 
XD  class,  because  each  task  may  be  decomposed  in  a  vari¬ 
ety  of  ways  each  with  their  own  cost  in  order  to  accomplish 
the  same  goal,  and  each  decomposition  affects  the  value  and 
feasibility  of  the  decompositions  of  other  tasks. 

ASMD  Problem  Formulation 

The  ASMD  can  be  formally  modeled  as  follows:  We  first 
define  a  task  /i  £  M  as  the  job  of  defending  a  ship  against 
an  individual  missile;  successfully  completing  all  tasks  re¬ 
sults  in  diverting  all  enemy  missiles.  We  also  define  an  agent 
d  £  D  (i.e.,  a  decoy)  as  an  actor  used  to  aid  in  accomplish¬ 
ing  tasks.  A  scheduling  assignment  is  then  represented  by  a 
four-tuple  ( d ,  p,,  x,  t),  where  d  is  a  decoy,  m  is  the  associated 
missile,  x  is  the  relative  location  (in  Cartesian  coordinates) 
of  the  decoy  relative  to  the  ship  and  t  is  the  moment  when 
the  decoy  should  be  deployed. 

As  ASMD  is  a  time-extended  problem,  the  formulation 
must  discretize  time.  However,  note  that  the  granularity  with 
which  the  task  of  protecting  the  ship  from  a  given  missile 
is  decomposed  as  a  function  of  time  is  a  modeling  choice 
with  ramifications  for  the  quality  and  computation  time  of  a 
solution.  Consider  a  missile  that  will  hit  a  ship  if  it  tracks 
a  missile  in  some  time  interval  [t,t')  for  a  duration  dt  = 
t  —  t'.  The  captain  might,  at  time  /:,  deploy  a  decoy  d,  such 
as  a  hovering  UAV,  that  is  able  to  last  the  entire  duration 
dt.  However,  it  may  be  preferable  to  deploy  one  or  more 
decoys  d\  each  of  which  remains  active  for  a  portion  of  the 
specified  time  interval.  Furthermore,  in  a  situation  wherein 
another  missile  in’  is  launched  before  m,  it  may  be  best  to 
have  a  decoy  deployed  before  t  that  can  divert  both  m  and 
m'  during  part  or  all  of  those  missiles’  flights. 

Because  we  do  not  know  a  priori  the  best  time  to  deploy  a 
decoy  that  can  be  used  for  varying  portions  (i.e.,  subtasks)  of 
the  task  of  mitigating  each  missile,  we  must  decompose  the 


task  into  sufficiently  small  time  steps.  Discretizing  time  ex¬ 
ponentially  increases  the  search  space,  and  thus  the  time  to 
compute  the  solution;  therefore,  there  is  a  balance  between 
optimality  (and  feasibility)  and  computation  time.  In  order 
to  generate  an  exact  solution,  we  chose  the  least-common 
multiple  of  the  time  constants,  which  is  trivially  1,  as  the 
unit  of  time  in  our  simulation. 

We  formulate  the  ASMD  as  a  mixed-integer  linear  pro¬ 
gram  in  Equations  4-25: 


min  z,  z  =  cdUd  —  &  53  Ad,m,  t  —  <*  53^™  (4) 
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Ad,m,t  <  Adtt,Vd,m,t  (5) 

Ad,m,t  <  Ud,m,Vd,m,t  (6) 

XdJ<Ud,Vd,l  (7) 

gdecoy  <•  gdecoi,  ,Vd,m  (8) 

Sd*^V  <  t  +  M(1  -  Adrrltt)1Vd,m,t  (9) 

Fdm°V  <  Fdecoy  ,  't/d,  m  (10) 

tAd,m,t  <  Fd  m,  t  (11) 

M(Ud,m  — 1)  <  Sde^y  -  F*“°*  -  1  +  ]T  Ad,m,f  <  M(  1  -  Udjm)  (12) 
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M(Ud  -  1)  <  Fdecoy  -  Sdecoy  -  dtBdvap  <  M(  1  -  Ud)  (13) 
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Ud,m<  53  XdJ,Vd,m  (15) 

1 1  m  seduced  by  decoy  d  in  location  l 

1  =  53  Ad,m,t  +  53  Gg,m,t,  Vm,  t  (16) 
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CPPe°r  -  Fdecoy  >  M(xdd  +Vm-  Jd,m  -  2), 

Vd,  l,  m  s.t.  decoy  d  in  location  1  would  cause  missile  m  to  impact  the  ship. 
gdecoy  _  ETAm  >  M (Xd  l  +  Vm  +  Jd  m  -  3),  Vd,  l,  m  s.t.  decoy  d 
in  location  1  would  cause  missile  m  to  impact  the  ship. 
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Vm  <  ^  Ad,m,t  >  Vm,  1 1 1  in  critical  region  for  missile  m. 
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2  >  Ad  m  t  +  Xd  i  +  Xd/  ,  Vd,  a  ,l,lf ,  m,  t  s.t.  missile  m  is  more 
attracted  to  decoy  d’  at  location  1’  than  decoy  d  at  location  1  at  time  t. 

1  >  Ad,m,t  +  Ad/  m  Vd,  d  ,  m,  t  s.t.  d  ^  d 
and  t  is  in  a  critical  region  before  impact. 
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This  formulation  incorporates  a  set  of  binary  decision 
variables:  Ad,m,t  &  {0, 1}  is  set  to  1  to  indicate  that  de¬ 
coy  d  is  assigned  to  missile  to  at  time  t,  and  is  0  otherwise. 
Ad,t  €  {0, 1}  is  set  to  1  to  indicate  that  decoy  d  is  assigned 
to  some  missile  at  time  t,  and  is  0  otherwise.  Ud,m  G  {0, 1} 
is  set  to  1  to  indicate  that  decoy  d  is  used  against  missile 
to,  and  is  0  otherwise.  Ud  €  {0, 1}  is  set  to  1  to  indi¬ 
cate  that  decoy  d  is  used  in  the  solution,  and  is  0  otherwise. 


Xd,i  &  {0, 1}  is  set  to  1  to  indicate  that  decoy  d  is  deployed 
at  location  l,  and  is  0  otherwise.  Vm  £  {0, 1}  is  set  to  1 
to  indicate  that  missile  m  has  been  effectively  diverted,  and 
is  0  otherwise.  Gg^m)t  £  {0, 1}  is  set  to  1  to  indicate  that 
missile  m  is  tracking  the  ship  at  time  t.  A  single  missile 
might  have  multiple,  separate  epochs  during  which  it  tracks 
the  ship  (e.g.,  it  first  tracks  the  ship,  then  tracks  a  decoy, 
then  tracks  the  ship  again  after  that  decoy  evaporates);  thus, 
the  program  can  choose  which  index  g  to  represent  the  var¬ 
ious  epochs  in  Gg}m}t.  Jd,m  £  {0, 1}  is  set  to  1  to  indicate 
that  decoy  d  is  deployed  after  missile  m  s  flight  (i.e.,  after  it 
either  hits  the  ship  or  is  guided  astray  by  a  decoy). 

The  program  contains  the  following  set  of  continuous 
variables:  S^e^v  represents  the  start  time  of  the  assignment 


of  decoy  d  to  missile  to,  and  S^ecoy  is  the  time  at  which 
decoy  d  is  deployed  from  the  ship.  Likewise,  F^e^v  repre¬ 
sents  the  finish  time  of  the  assignment  of  decoy  d  to  missile 
to,  and  F^emy  gjjjjgj.  ,-^g  tjme  at  whlch  the  decoy  evap¬ 
orates  or  the  end  of  the  engagement.  Ssgh^  indicates  the 
start  time  of  missile  to  tracking  the  ship  during  epoch  g,  and 
Fg  indicates  the  finish  time  of  missile  to  tracking  the  ship 
during  epoch  g. 

The  program  also  includes  the  following  set  of  constants: 
df^~tar9et  is  the  duration  for  which  a  missile  will  track  a 
single  target  (i.e.,  decoy  or  ship)  before  re-assessing  which 
target  is  best  to  track.  Thus,  if  the  missile  begins  tracking 
the  ship  at  time  t,  no  decoy  can  break  its  lock  during  the 
interval  [t,t  +  dtr7^~taraet).  ETAm  is  the  time  at  which 
missile  to  will  reach  the  ship’s  immediate  vicinity.  t^pear 
is  the  time  at  which  missile  to  is  first  close  enough  to  track 
the  ship.  Cd  represents  the  financial  cost  of  deploying  decoy 
d.  a ,  a',  and  a "  are  predefined  weighting  terms  for  the  ob¬ 
jective  function.  The  computational  complexity  of  this  for¬ 
mulation  is  dominated  by  the  integer  variables,  which  yields 

0^2  drnt+dm-\-dt-\-dl-\-d-\-grnt-\-m^ 

Equation  4  is  a  multi-criteria  objective  function  that  min¬ 
imizes  a  weighted,  linear  combination  of  the  cost  of  all  de¬ 
coy  deployments,  less  the  total  time  during  which  missiles 
are  tracking  decoys  and  the  number  of  missiles  successfully 
guided  away  from  the  ship. 

Equations  5-12  ensure  internal  consistency  between  the 
variables. Equation  13  ensures  that  a  decoy,  if  deployed,  is 
active  for  dted’ap  units  of  time  given  its  timing  characteris¬ 
tics.  Equation  14  ensures  that  a  decoy  is  deployed  to  no 
more  than  one  location.  Equation  15  ensures  that,  if  a  de¬ 
coy  is  deployed  against  a  missile,  its  deployment  location 
will  be  a  more  attractive  target  for  that  missile  than  the  ship. 
Equation  16  requires  that  each  missile  tracks  either  a  ship 
or  decoy  while  within  range.  Equations  17-18  force  a  de¬ 
coy,  if  deployed  to  a  location  that  would  cause  missile  m  to 
impact  the  ship,  to  either  be  deployed  after  the  missile  has 
already  been  diverted  or  reached  the  ship  (Equation  17)  or 
to  be  deployed  and  evaporate  before  the  missile  enters  tar¬ 
geting  range  (Equation  18). 

Equation  19  ensures  that  a  missile  must  be  tracking  a  de¬ 
coy  in  the  final  seconds  before  it  reaches  the  vicinity  of  the 
ship,  or  else  the  missile  will  impact  the  ship.  The  duration  of 
this  critical  period  is  dependent  upon  missile  dynamics  and 


the  target  selection  process. 

Equation  20  ensures  that  a  missile  will  select  the  most 
attractive  decoy  according  to  that  missile’s  selection  logic. 
Equation  21  restricts  decoy  deployments  such  that  the  mis¬ 
sile  heading  does  not  “sweep”  across  the  ship  in  the  final 
seconds  of  the  missile’s  flight.  If  a  missile  does  not  have 
enough  time  to  change  its  direction  toward  a  newly  deployed 
decoy,  that  missile  will  fly  into  the  ship. 

Equations  22-25  ensure  that  the  duration  of  epoch  g  of 
missile  m  while  tracking  the  ship  lasts  exactly  as  long  as  the 
retargeting  time  for  the  missile.  Equations  22-23  are  akin  to 
Equations  9-11  and  relate  the  start  and  finish  times  of  ship¬ 
tracking  epoch  g  to  the  decision  variable  Gff)Triit.  Equation 
24  is  akin  to  Equation  12  and  relates  the  start  and  finish  times 
of  ship-tracking  epoch  g  to  the  decision  variable  G gjn.t- 
Equation  25  ensures  that  the  tracking  time  is  dtr^~tar9et  if 
the  missile  is  airborne  for  at  least  dtrr^~tar9et  seconds.  Oth¬ 
erwise,  the  tracking  time  is  equal  to  the  time  before  impact¬ 
ing  the  ship  (i.e.,  ETAm  —  t  —  1).  Finally,  a  term  (i.e., 
—MGg^m  t- i)  disables  the  constraint  for  all  t,  except  for  the 
exact  moment  when  t  begins  tracking  the  ship. 

For  the  apprenticeship  scheduling  subroutine’s  schedula- 
bility  test,  we  apply  Equations  17-18  as  our  constraint  satis¬ 
faction  check  when  testing  the  feasibility  of  action  t*  ,  given 
by  applying  Equation  3.  With  regard  to  tasks  within  the  ap¬ 
prenticeship  scheduler’s  seed  solution  that  are  not  satisfac¬ 
torily  completed,  the  MILP  can  leave  those  tasks  incomplete 
to  start  by  initially  setting  Vm  •<—  0. 

Training  Dataset  for  Apprenticeship 
Scheduling 

The  algorithm  trains  the  apprenticeship  scheduler  using  a 
dataset  collected  from  military  domain  experts  playing  a  se¬ 
rious  game  that  emulates  the  ASMD  problem  as  formulated 
in  the  previous  section  (Gombolay  et  al.  2016a).  We  con¬ 
sidered  a  specific  level  within  the  game  that  requires  players 
to  defend  against  a  randomized  enemy  attack.  In  this  level, 
10  missiles  are  fired  at  the  player’s  ship  from  multiple  direc¬ 
tions,  and  the  player  has  access  to  a  limitted  quantity  of  five 
different  types  of  soft-kill  weapons  to  divert  these  missiles. 
Although  the  missile  bearings  and  launch  times  are  fixed, 
the  seeking  behavior  of  the  missile  is  not  known  a  priori. 

We  collected  a  dataset  of  3 1 1  games  played  by  35  human 
players  across  45  threat  configurations,  or  “scenarios.”  Of 
those  configurations,  only  40  contained  a  demonstration  in 
which  the  player  completed  an  entire  round.  We  then  sub¬ 
selected  the  single  best  demonstration  from  each  of  these  40 
scenarios.  The  demonstrators  included  ASMD  professionals 
with  expertise  ranging  from  “generally  knowledgeable  about 
the  ASMD  problem”  to  “domain  experts”  with  professional 
experience  or  ASMD  training. 

We  trained  the  apprenticeship  scheduling  algorithm  us¬ 
ing  the  following  features:  The  pointwise  features  for  each 
action  included  the  number  of  decoys  of  each  type  left  for 
possible  deployment  (i.e.,  the  ammunition).  The  pairwise 
features  for  each  action  included,  for  each  decoy/missile 
pair  (or  null  decoy  deployment  due  to  inaction),  indicators 
for  whether  a  decoy  had  been  placed  such  that  the  missile 
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Figure  2:  This  figure  depicts  the  total  computation  time  for 
COVAS,  as  well  as  the  amount  of  time  COVAS  required  to 
identify  a  solution  superior  to  that  resulting  from  a  human 
expert’s  demonstration. 

was  successfully  distracted  by  that  decoy,  whether  the  mis¬ 
sile  would  be  lured  into  hitting  the  ship  due  to  decoy  place¬ 
ment,  or  whether  the  missile  would  be  unaffected  by  decoy 
placement.  These  features  are  identical  to  those  employed 
in  (Gombolay  et  al.  2016a). 

Results  and  Discussion 

In  this  section,  we  empirically  validate  that  COVAS  is  able 
to  generate  optimal  solutions  more  efficiently  than  state-of- 
the-art  optimization  techniques.  As  a  benchmark,  we  solve 
a  pure  MILP  formulation  (Equations  4-25)  using  Gurobi, 
which  applies  state-of-the-art  techniques  for  heuristic  upper- 
bounds,  cutting  planes  and  LP  relaxation  lowerbounds.  We 
set  the  optimality  threshold  at  10~3. 

Validation  Against  Expert  Benchmark 

First,  we  validate  that  COVAS  can  efficiently  find  optimal 
solutions,  as  depicted  in  Figure  2.  To  generate  each  data 
point,  we  trained  COVAS’  apprenticeship  scheduling  al¬ 
gorithm  on  demonstrations  of  experts’  solutions  to  unique 
ASMD  scenarios  (save  for  one  “hold-out  scenario);  we  then 
tested  COVAS  on  this  hold-out  scenario.  We  also  applied 
a  pure  MILP  benchmark  on  this  scenario  and  compared  the 
performance  of  COVAS  to  the  benchmark.  We  generated 
one  data  point  for  each  unique  demonstrated  scenario  (i.e., 
leave-one-out  cross  validation)  to  validate  the  benefit  of  CO¬ 
VAS’. 

Figure  2  consists  of  two  performance  indicators:  The  to¬ 
tal  computation  time  required  for  the  MILP  benchmark  and 
COVAS  to  solve  for  the  optimal  solution  is  depicted  on  the 
left;  to  the  right  is  the  computation  time  required  for  the 
benchmark  and  COVAS  to  identify  a  solution  better  than  that 
given  by  a  human  expert.  This  figure  indicates  that  COVAS 
is  not  only  able  to  improve  overall  optimization  time,  but 
that  it  also  substantially  improves  computation  time  for  so¬ 
lutions  that  are  superior  to  those  produced  by  human  experts. 
The  average  improvement  in  computation  time  with  COVAS 
is  6.7x  and  3. lx,  respectively. 

Next,  we  evaluate  COVAS  ability  to  transfer  prior  learn¬ 
ing  to  more  challenging  task  sets.  We  trained  on  a  level  in 
the  ASMD  game  in  which  a  total  of  10  missiles  of  vary- 


Figure  3:  The  total  computation  time  needed  for  COVAS 
and  the  MILP  benchmark  to  identify  the  optimal  solution 
for  the  tested  scenarios. 

ing  types  came  from  specific  bearings  at  given  times.  We 
randomly  generated  a  set  of  scenarios  involving  15  and  20 
missiles,  with  bearings  and  times  randomly  sampled  with 
replication  from  the  set  of  bearings  used  in  the  10-missile 
scenario. 

Figure  3  depicts  the  computation  time  required  by  CO¬ 
VAS  and  the  MILP  benchmark  to  identify  the  optimal  so¬ 
lution  for  scenarios  involving  10,  15  and  20  missiles.  We 
found  that  the  average  improvement  to  computation  time 
with  COVAS  was  4.6x,  7.9x,  and  9.5x,  respectively.  This 
evaluation  demonstrates  that  COVAS  is  able  to  efficiently 
leverage  the  solutions  of  human  domain  experts  to  quickly 
solve  problems  twice  as  large  as  those  the  demonstrator  pro¬ 
vided  for  training. 

Limitations  and  Future  Work 

COVAS  is  able  to  leverage  the  power  of  expert  scheduling 
demonstrations  to  speed  up  the  computation  of  provable, 
globally  optimal  scheduling  solutions.  However,  the  ap¬ 
proach  is  still  limited  by  the  quality  of  the  demonstrations 
provided  by  the  experts  and  the  ability  of  the  apprenticeship 
scheduling  algorithm  to  generalize  the  information  within 
those  demonstrations.  The  MILP’s  computation  time  is  ex¬ 
pedited  by  tight  upperbounds  (i.e.,  an  initial  seed)  provided 
by  the  apprenticeship  scheduling  algorithm.  If  the  appren¬ 
ticeship  scheduling  algorithm  is  unable  to  provide  a  tight 
upperbound,  the  MILP’s  computation  time  may  not  be  sig¬ 
nificantly  improved. 

In  future  work,  we  will  explore  extensions  to  the  appren¬ 
ticeship  scheduling  algorithm  to  improve  its  ability  to  learn 
from  noisy  demonstrations.  One  approach  could  be  to  incor¬ 
porate  a  trustworthiness  metric  a  la  (Zhang  2009)  directly 
into  the  training  of  the  classifier  to  uncover  a  latent  action 
ranking.  For  example,  instead  of  binary  labels,  we  could 
reformulate  the  problem  to  be  one  of  regression,  where  pos¬ 
itive  and  negative  labels  are  proportional  and  inversely  pro¬ 
portional,  respectively,  to  the  fidelity  of  the  demonstrator. 

We  also  aim  to  extend  COVAS  to  a  stochastic  architec¬ 
ture  to  reason  about  uncertainty  over  task  assignment  char¬ 
acteristics  (e.g.,  missile  behavior)  and  temporal  dependen¬ 
cies  (e.g.,  start  and  finish  times,  etc.). 

Conclusions 

In  this  work,  we  developed  and  demonstrated  an  approach  to 
learning  through  human  demonstrations  to  efficiently  pro¬ 


duce  optimal  solutions  for  complex  real-world  scheduling 
problems.  We  showed  that  policies  learned  from  human 
experts  can  be  used  in  conjunction  with  a  MILP  solver  to 
substantially  improve  the  efficiency  of  a  branch-and-bound 
search  for  an  optimal  schedule.  We  empirically  validated  our 
technique  on  a  dataset  collected  from  human  experts  solving 
an  anti-ship  missile  defense  problem,  which  represents  the 
hardest  class  of  scheduling  problems.  We  showed  that  our 
approach  can  substantially  improve  upon  solutions  produced 
by  human  domain  experts,  at  a  rate  up  to  ~  10  times  faster 
than  as  an  optimization  approach  that  does  not  incorporate 
human  expert  demonstration. 
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